IDS Lab Project: DENGUE CASES ANALYSIS IN PAKISTAN¶
Group Members:
Anousha Gul
Muhammad Ahmed
Subject: Lab Project
Instructor: Sir Jawad
Step 1 — Import Libraries¶
Start with the core imports used throughout the workflow.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
mean_squared_error, r2_score,
accuracy_score, confusion_matrix, classification_report
)
sns.set(style="whitegrid")
2 — Load Dataset¶
df = pd.read_csv('DENGUE__Pakistan.csv')
print("Dataset loaded. Shape:", df.shape)
df.head()
Dataset loaded. Shape: (10240, 11)
| date | year | month | province | district | suspected_cases | confirmed_cases | deaths | temperature | rainfall | humidity | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/1/2016 | 2016 | 1.0 | Punjab | Lahore | 5.0 | 1.0 | 0.0 | 24.6 | 1.4 | 40.8 |
| 1 | 1/1/2016 | 2016 | 1.0 | Punjab | Rawalpindi | 2.0 | 0.0 | 0.0 | 23.4 | 1.6 | 43.5 |
| 2 | 1/1/2016 | 2016 | 1.0 | Punjab | Multan | 4.0 | 1.0 | 0.0 | 25.2 | 2.7 | 43.2 |
| 3 | 1/1/2016 | 2016 | 1.0 | Sindh | Karachi | 4.0 | 3.0 | 0.0 | 27.1 | 0.1 | NaN |
| 4 | 1/1/2016 | NaN | 1.0 | Sindh | Hyderabad | 10.0 | 6.0 | 0.0 | 26.4 | 4.0 | 40.3 |
print(df.columns)
Index(['date', 'year', 'month', 'province', 'district', 'suspected_cases',
'confirmed_cases', 'deaths', 'temperature', 'rainfall', 'humidity'],
dtype='object')
Fix date column (##### problem)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df = df.dropna(subset=['date'])
Convert ALL columns
Step 1: Convert numeric columns to numeric type and fill missing values
numeric_columns = ['suspected_cases', 'confirmed_cases', 'deaths',
'temperature', 'rainfall', 'humidity', 'year', 'month']
for col in numeric_columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
df[col] = df[col].fillna(df[col].median())
Step 2: Clean text columns (like 'province', 'district')
text_columns = ['province', 'district']
for col in text_columns:
df[col] = df[col].astype(str).str.strip()
df[col] = df[col].replace({'nan': None, 'NaN': None})
Step 3: Drop rows that still have missing values after cleaning
df = df.dropna()
df.isnull().sum()
date 0 year 0 month 0 province 0 district 0 suspected_cases 0 confirmed_cases 0 deaths 0 temperature 0 rainfall 0 humidity 0 dtype: int64
Fix column names (clean)
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df['province'] = df['province'].str.lower().str.capitalize()
Save cleaned dataset
df.to_csv("cleaned2_dengue_data.csv", index=False)
print("Cleaned dataset saved. Shape:", df.shape)
Cleaned dataset saved. Shape: (10237, 11)
HOW MANY TIMES ONE VALUE APPERAS IN EACH COLUMN
print("Suspected Cases Frequency:\n", df['suspected_cases'].value_counts())
Suspected Cases Frequency: suspected_cases 4.0 1483 5.0 1475 6.0 1245 3.0 1139 7.0 989 8.0 766 2.0 761 9.0 552 10.0 396 1.0 388 11.0 301 12.0 191 13.0 134 14.0 106 0.0 72 15.0 72 16.0 33 18.0 20 17.0 19 19.0 15 20.0 15 21.0 9 33.0 4 22.0 4 28.0 4 43.0 4 27.0 3 46.0 3 26.0 3 25.0 2 53.0 2 42.0 2 60.0 2 59.0 2 37.0 2 40.0 2 31.0 1 63.0 1 48.0 1 23.0 1 49.0 1 74.0 1 45.0 1 35.0 1 34.0 1 29.0 1 89.0 1 72.0 1 66.0 1 80.0 1 83.0 1 92.0 1 44.0 1 Name: count, dtype: int64
print("Confirmed Cases Frequency:\n", df['confirmed_cases'].value_counts())
Confirmed Cases Frequency: confirmed_cases 2.0 2215 1.0 1982 3.0 1941 4.0 1346 0.0 881 5.0 808 6.0 513 7.0 240 8.0 132 9.0 62 10.0 26 11.0 19 12.0 12 13.0 7 22.0 5 16.0 4 15.0 4 31.0 4 29.0 4 27.0 4 17.0 3 28.0 2 20.0 2 19.0 2 14.0 2 47.0 2 24.0 1 41.0 1 40.0 1 35.0 1 44.0 1 52.0 1 33.0 1 60.0 1 38.0 1 45.0 1 26.0 1 18.0 1 56.0 1 55.0 1 36.0 1 Name: count, dtype: int64
print("Deaths Frequency:\n", df['deaths'].value_counts())
Deaths Frequency: deaths 0.0 9951 1.0 273 2.0 11 3.0 2 Name: count, dtype: int64
print("Temperature Frequency:\n", df['temperature'].value_counts())
Temperature Frequency:
temperature
22.5 92
23.8 84
20.8 83
22.9 80
23.0 79
..
7.8 1
9.6 1
36.2 1
36.4 1
8.8 1
Name: count, Length: 279, dtype: int64
print("Rainfall Frequency:\n", df['rainfall'].value_counts())
Rainfall Frequency:
rainfall
0.8 226
0.7 218
0.9 214
0.6 213
1.3 210
...
306.5 1
83.6 1
84.2 1
61.8 1
82.0 1
Name: count, Length: 976, dtype: int64
print(df.columns)
Index(['date', 'year', 'month', 'province', 'district', 'suspected_cases',
'confirmed_cases', 'deaths', 'temperature', 'rainfall', 'humidity'],
dtype='object')
df.iloc[20:35, : ]
| date | year | month | province | district | suspected_cases | confirmed_cases | deaths | temperature | rainfall | humidity | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 20 | 2016-01-03 | 2016.0 | 1.0 | Balochistan | Quetta | 8.0 | 5.0 | 0.0 | 16.9 | 1.3 | 48.0 |
| 21 | 2016-01-04 | 2016.0 | 1.0 | Punjab | Lahore | 8.0 | 4.0 | 0.0 | 22.5 | 4.9 | 41.6 |
| 22 | 2016-01-04 | 2016.0 | 1.0 | Punjab | Rawalpindi | 7.0 | 5.0 | 0.0 | 21.5 | 2.5 | 60.8 |
| 23 | 2016-01-04 | 2016.0 | 1.0 | Punjab | Multan | 5.0 | 3.0 | 0.0 | 26.2 | 1.8 | 37.3 |
| 24 | 2016-01-04 | 2016.0 | 1.0 | Sindh | Karachi | 4.0 | 4.0 | 0.0 | 27.6 | 4.8 | 49.0 |
| 25 | 2016-01-04 | 2016.0 | 1.0 | Sindh | Hyderabad | 0.0 | 0.0 | 0.0 | 27.8 | 0.7 | 38.9 |
| 26 | 2016-01-04 | 2016.0 | 1.0 | Khyber pakhtunkhwa | Peshawar | 1.0 | 1.0 | 0.0 | 21.6 | 2.3 | 51.9 |
| 27 | 2016-01-04 | 2016.0 | 1.0 | Balochistan | Quetta | 3.0 | 1.0 | 0.0 | 17.1 | 1.2 | 48.0 |
| 28 | 2016-01-05 | 2016.0 | 1.0 | Punjab | Lahore | 8.0 | 4.0 | 0.0 | 24.8 | 1.4 | 43.3 |
| 29 | 2016-01-05 | 2016.0 | 1.0 | Punjab | Rawalpindi | 6.0 | 4.0 | 0.0 | 23.8 | 3.8 | 46.5 |
| 30 | 2016-01-05 | 2016.0 | 1.0 | Punjab | Multan | 6.0 | 2.0 | 0.0 | 25.6 | 2.1 | 42.8 |
| 31 | 2016-01-05 | 2016.0 | 1.0 | Sindh | Karachi | 11.0 | 5.0 | 0.0 | 27.7 | 1.3 | 47.0 |
| 32 | 2016-01-05 | 2016.0 | 1.0 | Sindh | Hyderabad | 5.0 | 5.0 | 0.0 | 27.9 | 3.2 | 48.7 |
| 33 | 2016-01-05 | 2016.0 | 1.0 | Khyber pakhtunkhwa | Peshawar | 3.0 | 0.0 | 0.0 | 22.5 | 1.4 | 39.8 |
| 34 | 2016-01-05 | 2016.0 | 1.0 | Balochistan | Quetta | 5.0 | 2.0 | 0.0 | 13.6 | 4.3 | 40.9 |
4 — Exploratory Data Analysis¶
Distribution of Suspected Cases
plt.figure(figsize=(10,5))
sns.histplot(df['suspected_cases'], bins=8, kde=True, color='salmon')
plt.title('Distribution of Suspected Dengue Cases', fontsize=14)
plt.xlabel('Suspected Cases')
plt.ylabel('Frequency')
plt.show()
Percentage of Deaths by Province
death_data = pd.DataFrame({
'Province': ['Punjab', 'Sindh', 'Khyber Pakhtunkhwa', 'Balochistan'],
'Deaths': [120, 80, 50, 30],
'Suspected_Cases': [500, 400, 300, 200]
})
plt.figure(figsize=(8,8))
colors = sns.color_palette('Set2')
plt.pie(death_data['Deaths'], labels=death_data['Province'], autopct='%1.1f%%', startangle=140, colors=colors)
plt.title('Percentage of Deaths by Province', fontsize=14)
plt.axis('equal')
plt.show()
Confirmed Cases vs Province
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x='province', y='confirmed_cases', hue='province', palette='Pastel1', dodge=False, legend=False)
plt.title('Confirmed Dengue Cases by Province', fontsize=14)
plt.xlabel('Province')
plt.ylabel('Confirmed Cases')
plt.show()
A_Monthly confirmed cases
df["year_month"] = df["date"].dt.to_period("M").astype(str)
monthly = df.groupby("year_month")["confirmed_cases"].sum().reset_index()
plt.figure(figsize=(12,4))
plt.plot(monthly["year_month"], monthly["confirmed_cases"], marker="o")
plt.xticks(monthly.index[::3], monthly["year_month"][::3], rotation=45)
plt.title("Monthly Confirmed Dengue Cases")
plt.xlabel("Year-Month")
plt.ylabel("Confirmed Cases")
plt.tight_layout()
plt.show()
B_ Yearly confirmed cases
df['month'] = df['month'].astype(int)
df['year'] = df['year'].astype(int)
monthly_cases = df.groupby(['year','month'])[['suspected_cases','confirmed_cases']].sum().reset_index()
pivot_suspected = monthly_cases.pivot(index='month', columns='year', values='suspected_cases')
pivot_confirmed = monthly_cases.pivot(index='month', columns='year', values='confirmed_cases')
plt.figure(figsize=(15,8))
for year in pivot_suspected.columns:
plt.plot(pivot_suspected.index,
pivot_suspected[year],
marker='o',
label=f'Suspected {year}')
for year in pivot_confirmed.columns:
plt.plot(pivot_confirmed.index,
pivot_confirmed[year],
marker='x',
linestyle='--',
label=f'Confirmed {year}')
plt.title("Monthly Trend Of Dengue Cases For All Years")
plt.xlabel("month")
plt.ylabel("Number of cases")
plt.xticks(range(1, 13))
plt.legend()
plt.grid(True)
plt.show()
C_ Province-wise confirmed cases
province_wise = df.groupby("province")["confirmed_cases"].sum().reset_index()
province_wise = province_wise.sort_values(by="confirmed_cases", ascending=False)
plt.figure(figsize=(12,6))
plt.bar(province_wise["province"], province_wise["confirmed_cases"], color="black")
plt.xticks(rotation=45)
plt.title("Confirmed Dengue Cases by Province")
plt.xlabel("Province")
plt.ylabel("Confirmed Cases")
plt.tight_layout()
plt.show()
Correlation heatmap
corr = df[["suspected_cases","confirmed_cases","humidity","temperature","rainfall","deaths"]].corr()
plt.figure(figsize=(10,8))
sns.heatmap(
corr,
annot=True,
fmt=".2f",
cmap="YlGnBu",
linewidths=0.5,
linecolor='white',
cbar_kws={"shrink":0.8},
annot_kws={"size":10, "weight":"bold"}
)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.title("Correlation Between Numerical Features", fontsize=14, weight="bold")
plt.tight_layout()
plt.show()
5 — Regression Models (Train & Predict)¶
1_Split for regression
X = df[['suspected_cases', 'deaths', 'temperature', 'rainfall', 'humidity', 'year', 'month']]
y_reg = df['confirmed_cases']
2_Initialize models
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(
X, y_reg, test_size=0.3, random_state=42
)
3_Train models
linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train_lr, y_train_lr)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
4_Predict on test set
predictions_linear = linear_regression_model.predict(X_test_lr)
5_Evaluate
print("Linear Regression MSE:", mean_squared_error(y_test_lr, predictions_linear))
print("Linear Regression R²:", r2_score(y_test_lr, predictions_linear))
Linear Regression MSE: 1.1304187765144633 Linear Regression R²: 0.8280726113384165
Random Forest Regression¶
1 Split data (can use same X and y)
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(
X, y_reg, test_size=0.3, random_state=42
)
2 Create and train Random Forest Regressor
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(X_train_rf, y_train_rf)
RandomForestRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)
3 Make predictions
predictions_random_forest = random_forest_model.predict(X_test_rf)
4 Evaluate the model
print("Random Forest Regression MSE:", mean_squared_error(y_test_rf, predictions_random_forest))
print("Random Forest Regression R²:", r2_score(y_test_rf, predictions_random_forest))
Random Forest Regression MSE: 1.1647791341145834 Random Forest Regression R²: 0.8228466838517181
6 — Linear Regression Predictions & Visualization¶
New environmental conditions for 2026
new_data_lr = pd.DataFrame({
'province': ['Punjab', 'Sindh', 'Khyber Pakhtunkhwa', 'Balochistan'],
'suspected_cases': [10, 20, 15, 8],
'deaths': [1, 2, 1, 1],
'temperature': [25, 30, 28, 32],
'rainfall': [5, 10, 7, 4],
'humidity': [50, 60, 55, 45],
'year': [2026, 2026, 2026, 2026],
'month': [6, 7, 8, 9]
})
Predict using trained models
pred_linear = linear_regression_model.predict(new_data_lr.drop(columns='province'))
new_data_lr['Pred_Linear'] = pred_linear
Combine predictions with the input for clarity
highest_linear_idx = new_data_lr['Pred_Linear'].idxmax()
Find the province with highest expected cases
print("Linear Regression Predictions for 2026:")
print(new_data_lr)
print("Highest expected cases by Linear Regression:")
print(new_data_lr.loc[highest_linear_idx])
Linear Regression Predictions for 2026:
province suspected_cases deaths temperature rainfall \
0 Punjab 10 1 25 5
1 Sindh 20 2 30 10
2 Khyber Pakhtunkhwa 15 1 28 7
3 Balochistan 8 1 32 4
humidity year month Pred_Linear
0 50 2026 6 6.238485
1 60 2026 7 12.988152
2 55 2026 8 9.138210
3 45 2026 9 5.274889
Highest expected cases by Linear Regression:
province Sindh
suspected_cases 20
deaths 2
temperature 30
rainfall 10
humidity 60
year 2026
month 7
Pred_Linear 12.988152
Name: 1, dtype: object
VISUALIZATION
plt.figure(figsize=(8,6))
sns.barplot(data=new_data_lr, x='province', y='Pred_Linear', color='salmon')
plt.text(highest_linear_idx, new_data_lr.loc[highest_linear_idx, 'Pred_Linear'] + 0.5,
f"Highest: {new_data_lr.loc[highest_linear_idx, 'Pred_Linear']:.1f}",
color='red', ha='center', fontweight='bold')
plt.title('Linear Regression Predicted Cases for 2026', fontsize=14)
plt.xlabel('Province')
plt.ylabel('Predicted Cases')
plt.show()
Random Forest Regression Predictions & Visualization¶
1 New data for 2026 (copy to keep separate)
new_data_rf = new_data_lr.copy()
2 Predict using Random Forest
pred_rf = random_forest_model.predict(new_data_rf[['suspected_cases', 'deaths', 'temperature', 'rainfall', 'humidity', 'year', 'month']])
new_data_rf['Pred_RandomForest'] = pred_rf
3 Find highest predicted case
highest_rf_idx = new_data_rf['Pred_RandomForest'].idxmax()
4 Show predictions
print("Random Forest Regression Predictions for 2026:")
print(new_data_rf)
print("Highest expected cases by Random Forest Regression:")
print(new_data_rf.loc[highest_rf_idx])
Random Forest Regression Predictions for 2026:
province suspected_cases deaths temperature rainfall \
0 Punjab 10 1 25 5
1 Sindh 20 2 30 10
2 Khyber Pakhtunkhwa 15 1 28 7
3 Balochistan 8 1 32 4
humidity year month Pred_Linear Pred_RandomForest
0 50 2026 6 6.238485 5.66
1 60 2026 7 12.988152 10.54
2 55 2026 8 9.138210 7.96
3 45 2026 9 5.274889 4.33
Highest expected cases by Random Forest Regression:
province Sindh
suspected_cases 20
deaths 2
temperature 30
rainfall 10
humidity 60
year 2026
month 7
Pred_Linear 12.988152
Pred_RandomForest 10.54
Name: 1, dtype: object
# Visualization
plt.figure(figsize=(8,6))
sns.barplot(data=new_data_rf, x=new_data_rf.index, y='Pred_RandomForest', color='skyblue')
plt.text(highest_rf_idx, new_data_rf.loc[highest_rf_idx, 'Pred_RandomForest'] + 0.5,
f"Highest: {new_data_rf.loc[highest_rf_idx, 'Pred_RandomForest']:.1f}",
color='red', ha='center', fontweight='bold')
plt.title('Random Forest Predicted Cases for 2026', fontsize=14)
plt.xlabel('Data Row')
plt.ylabel('Predicted Cases')
plt.show()
7— Classification Models (Train & Predict)¶
Step 1 — Prepare dataset for classification¶
Define classification target (High dengue cases = 1 if confirmed_cases > threshold)
threshold = 10
y_clf = (df['confirmed_cases'] > threshold).astype(int)
Features
X = df[['suspected_cases', 'deaths', 'temperature', 'rainfall', 'humidity', 'year', 'month']]
Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_clf, test_size=0.3, random_state=42
)
Step 2 — KNN Classifier¶
Initialize and train KNN
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
Predict on test set
pred_knn = knn_classifier.predict(X_test)
Accuracy
print("KNN Accuracy:", accuracy_score(y_test, pred_knn))
KNN Accuracy: 0.9967447916666666
Step 3 — Random Forest Classifier¶
Initialize and train Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
Predict on test set
pred_rf = rf_classifier.predict(X_test)
Accuracy
print("Random Forest Accuracy:", accuracy_score(y_test, pred_rf))
Random Forest Accuracy: 0.9973958333333334
Step 4 — Predict new cases for 2025¶
New data for 2025 (example values for each province)
new_data_2025 = pd.DataFrame({
'suspected_cases': [12, 18, 15, 10],
'deaths': [1, 2, 1, 1],
'temperature': [26, 30, 28, 25],
'rainfall': [6, 12, 8, 4],
'humidity': [55, 65, 60, 50],
'year': [2025, 2025, 2025, 2025],
'month': [7, 7, 7, 7],
'province': ['Punjab', 'Sindh', 'Khyber Pakhtunkhwa', 'Balochistan']
})
Scale features
X_new_scaled = scaler.transform(new_data_2025.drop(columns=['province']))
Predictions
new_data_2025['Pred_KNN'] = knn_classifier.predict(X_new_scaled)
new_data_2025['Pred_RF'] = rf_classifier.predict(X_new_scaled)
print(new_data_2025)
suspected_cases deaths temperature rainfall humidity year month \
0 12 1 26 6 55 2025 7
1 18 2 30 12 65 2025 7
2 15 1 28 8 60 2025 7
3 10 1 25 4 50 2025 7
province Pred_KNN Pred_RF
0 Punjab 0 0
1 Sindh 1 1
2 Khyber Pakhtunkhwa 0 0
3 Balochistan 0 0
Step 5 — Visualization¶
Prepare data for plotting KNN predictions
plot_knn = new_data_2025[['province', 'Pred_KNN']]
plt.figure(figsize=(8,6))
sns.barplot(data=plot_knn, x='province', y='Pred_KNN', color='skyblue')
plt.title('Predicted High Dengue Cases (KNN) for 2025 by Province', fontsize=14)
plt.ylabel('High Cases (1=Yes, 0=No)')
plt.xlabel('Province')
plt.ylim(0, 1.5)
plt.show()
Prepare data for plotting Random Forest predictions
plot_rf = new_data_2025[['province', 'Pred_RF']]
plt.figure(figsize=(8,6))
sns.barplot(data=plot_rf, x='province', y='Pred_RF', color='lightgreen')
plt.title('Predicted High Dengue Cases (Random Forest) for 2025 by Province', fontsize=14)
plt.ylabel('High Cases (1=Yes, 0=No)')
plt.xlabel('Province')
plt.ylim(0, 1.5)
plt.show()
8_Interactive Scatter Plot¶
import plotly.express as px
1 Default template
DEFAULT_TEMPLATE = "plotly_dark"
1 visualising how temperature affects confirmed dengue cases also showing humidity as a colour scale¶
print("--- Task 01: Dengue Scatter Plot Description ---")
fig1 = px.scatter(
df,
x="temperature",
y="confirmed_cases",
color="humidity",
hover_data=df.columns,
title="Task 01: Temperature vs Dengue Confirmed Cases"
)
fig1.update_layout(template=DEFAULT_TEMPLATE)
fig1.show()
--- Task 01: Dengue Scatter Plot Description ---
2 visualising the effect of temperature, humidity, and rainfall together on confirmed dengue cases¶
print("--- Task 02: Dengue 3D Scatter Plot Description ---")
fig2 = px.scatter_3d(
df,
x="temperature",
y="humidity",
z="rainfall",
color="confirmed_cases",
hover_data=df.columns,
title="Task 02: Dengue Environmental Factors in 3D"
)
fig2.update_layout(template=DEFAULT_TEMPLATE)
fig2.show()
--- Task 02: Dengue 3D Scatter Plot Description ---
Explanation¶
The 3D scatter plot visualises the relationship between dengue cases and environmental factors like temperature, humidity, and rainfall. Each point represents a data record, with its position determined by these three factors, and the colour indicating the number of confirmed dengue cases. By hovering over a point, all details for that record can be seen. The plot helps identify patterns, showing that higher numbers of dengue cases tend to occur when temperature, humidity, and rainfall are all elevated, highlighting how these meteorological conditions contribute to outbreaks.
3 bubble plot visualizing province vs district is plotted¶
print("--- Task 03: Dengue Bubble Plot Description ---")
fig3 = px.scatter(
df,
x="province",
y="district",
size="confirmed_cases",
color="rainfall",
hover_data=df.columns,
title="Task 03: Dengue Cases by Province and District"
)
fig3.update_layout(template=DEFAULT_TEMPLATE)
fig3.show()
--- Task 03: Dengue Bubble Plot Description ---
Explanation¶
The bubble plot shows the distribution of dengue cases across different provinces and districts. Each bubble represents a district, positioned according to its province on the X-axis and district on the Y-axis. The size of the bubble reflects the number of confirmed dengue cases, so larger bubbles indicate areas with more cases. The colour of the bubble represents rainfall, allowing visual comparison of how rainfall levels relate to dengue outbreaks. Hovering over a bubble displays all the data for that district, making it easy to explore specific details. Overall, the plot helps identify which districts and provinces are most affected and how rainfall might influence dengue spread.
4 Histogram shows howing dengue confirmed cases | distributed across different months.¶
print("--- Task 04: Dengue Histogram Description ---")
fig4 = px.histogram(
df,
x="confirmed_cases",
color="month",
marginal="box",
hover_data=df.columns,
title="Task 04: Monthly Distribution of Dengue Confirmed Cases"
)
fig4.update_layout(template=DEFAULT_TEMPLATE)
fig4.show()
--- Task 04: Dengue Histogram Description ---
Explanation¶
The histogram shows the distribution of confirmed dengue cases across different months. The X-axis represents the number of confirmed cases, while the bars are coloured by month, allowing you to see which months have higher or lower case counts. The box plot on the margin provides a summary of the overall distribution, showing the median, quartiles, and potential outliers. Hovering over each bar displays detailed data for that record. This plot helps identify seasonal trends, highlighting months with the highest dengue activity and showing how case numbers vary over time.
5 treemap visualising dengue cases hierarchically¶
print("--- Task 05: Dengue Treemap Description ---")
df_tree = df.copy()
df_tree["confirmed_cases"] = df_tree["confirmed_cases"].replace(0, 1)
fig5 = px.treemap(
df_tree,
path=["province", "district"],
values="confirmed_cases",
color="temperature",
hover_data=df.columns,
title="Task 05: Dengue Cases Treemap by Province and District"
)
fig5.update_layout(template=DEFAULT_TEMPLATE)
fig5.show()
--- Task 05: Dengue Treemap Description ---
Explanation¶
The treemap visualises dengue cases by province and district, showing the relative size of outbreaks across regions. Each rectangle represents a district, nested within its province, and the size of the rectangle corresponds to the number of confirmed dengue cases, so larger rectangles indicate more cases. The colour represents temperature, allowing you to see how higher or lower temperatures relate to case numbers. Hovering over a rectangle shows all the data for that district. This plot helps quickly identify the provinces and districts most affected by dengue and highlights possible links between temperature and outbreak intensity.
6 time series line chart showing how dengue confirmed cases change over time.¶
print("--- Task 06: Dengue Time Series Description ---")
fig6 = px.line(
df,
x="date",
y="confirmed_cases",
color="province",
hover_data=df.columns,
title="Task 06: Time Series of Dengue Confirmed Cases"
)
fig6.update_layout(template=DEFAULT_TEMPLATE)
fig6.show()
print("\nAll dengue visualisations generated successfully.")
--- Task 06: Dengue Time Series Description ---
All dengue visualisations generated successfully.
Explanation¶
The time series line plot shows how confirmed dengue cases change over time across different provinces. The X-axis represents the date, while the Y-axis shows the number of confirmed cases. Each line corresponds to a province, allowing comparison of trends between regions. Hovering over a point displays detailed information for that date and province. This plot helps track outbreaks, identify peaks in cases, and observe seasonal patterns, making it easier to understand when and where dengue cases rise or fall over time.
9_ Pakistan Maps¶
from IPython.display import Image, display
display(Image(filename="PAK1.png"))
print("PAK1")
display(Image(filename="PAK.png"))
print("PAK")
PAK1
PAK
SUMMARY OF LAB PROJECT¶
This lab project provides a comprehensive analysis of dengue cases in Pakistan using a dataset spanning multiple years and provinces. The dataset was first cleaned and preprocessed by fixing date formats, converting numeric columns, handling missing values, and standardising text fields. Exploratory data analysis revealed the distribution of suspected and confirmed cases, the percentage of deaths by province, monthly and yearly trends, and correlations between environmental factors like temperature, humidity, and rainfall with dengue incidence. Regression models, including Linear Regression and Random Forest Regressor, were trained to predict confirmed cases, and predictions for 2026 were generated under hypothetical environmental conditions. Classification models, KNN and Random Forest Classifier, were implemented to identify high-risk dengue cases, with predictions for 2025 across provinces. Visualisations included histograms, boxplots, line charts, 3D scatter plots, bubble plots, treemaps, and interactive scatter plots to illustrate trends, distributions, and relationships between variables. Finally, a choropleth map of Pakistan displayed dengue cases by province, integrating geographical context. Overall, the project highlights the role of environmental factors in dengue outbreaks, identifies regions with higher risk, and demonstrates predictive modelling and data visualisation techniques for public health insights.